This section is still under construction and will be completed in the near future. Please do not go beyond this point for now.
In this section, we will have a look at a more realistic example of industrial data.
20.1 Setting the stage
Imagine a manufacturing chain which is dedicated to produce knifes with wooden handles. Various machines are interconnected and the process flow is as follows: - Machine A prepares the steel blade and shaft - Machine B applies epoxy to the grip area - Machine C inserts the wooden handle material. - Machine D is a curing oven, hardening the adhesive. - Machine E coats the knife with protective finish and completes the manufacturing process.
Operators notice that the process is not running as smoothly as expected. The handle is not always properly bonded to the metal part and sometimes the wood is cracked. They find that the root cause of the issue lies in the epoxy application step of machine B. Common issues include excessive or insufficient epoxy application, uneven distribution, and occasional poor adhesion to the shaft. One clear economic impact is that these defect products can not be sold. But there is more to it: - Waste of material is not sustainable - The defect actually already happens in Machine B, but Machine C to E still process the faulty products, leading to further waste and costs, especially since the wooden handle is the most expensive part of the product and the curing oven is very energy-intensive.
Since there is a new department focused on data science, they decide to manually inspect the epoxy application right after machine B and label the data accordingly. Additionally the machine collects some sensor data in a CSV file. This dataset is given to the data science team and in tight collaboration with the operators the ambitious goal is to: - Optimize the epoxy application process to reduce defects and waste. - Human inspection is time-consuming and expensive. Find a way to predict defects occuring in Machine B before the product is handed over to machine C.
Of course, this data science department is us.
20.2 Dataset
The dataset consists of sensor readings of a simulated industrial process and consists of two files:
simulated_machine_data.csv stores the sensor data
simulated_inspection_data.csv stores the human inspection results
The sensor data includes the following features in a 200ms resolution:
product_id: Unique identifier for each product.
process_step_index: Index indicating the current step within the process. The machine performs three processing steps: Moving the nozzle to the shaft, applying epoxy, and blow out residual epoxy from nozzle while moving away.
timestamp: Timestamp of the sensor reading.
air_temperature_C: Ambient air temperature in degrees Celsius.
process_speed_mm_s: Set speed of the process in millimeters per second.
pressure_measured_bar: Measured pressure in bar.
pressure_setting_bar: Pressure setting in bar.
machine_temperature_C: Temperature of the machine in degrees Celsius.
The inspection data includes the product_id and the error_code:
product_id: Unique identifier for each product.
error_code: Code indicating the type of error detected during inspection.
They used OK when there was no defect, NOK_1 (excessive epoxy), NOK_2 (insufficient epoxy), NOK_3 (poor adhesion), NOK_4 (uneven distribution).
20.3 Additional information
The operators state that the epoxy application process is highly sensitive to variations in environmental conditions, such as temperature and humidity. They believe that incorporating additional sensor data, such as humidity levels, could further enhance understanding of the process. Also, the pressure settings are critical, since they directly influence the amount of epoxy applied to the product. The first processing step ensures the correct positioning of the nozzle and the final processing step is supposed to clean the nozzle, preparing it for the next application.
20.4 Loading packages
import osimport pandas as pd # used for data handlingimport seaborn as sns # used for statistical data visualizationimport plotly.express as px # used for performant plottingimport plotly.io as pio # used to set the default plotly rendererpio.renderers.default ="notebook"# set the default plotly renderer to "notebook" (necessary for quarto to render the plots)
20.5 Loading the dataset
The dataset is available online in form of two csv files.
if df_processes.isna().sum().sum() >0or\ df_inspections.isna().sum().sum() >0or\ df_processes.duplicated().sum().sum() >0or\ df_inspections.duplicated().sum().sum() >0:raiseValueError("Missing values or duplicates found in the dataframes.")else:print("No missing values or duplicates found in the dataframes.")
No missing values or duplicates found in the dataframes.
NOK_2 and NOK_3 seem to be hardly ever happening. Let us mentally drop these error codes for this investigation, since <= 10 samples give a very weak statistical basis and tend to distort plots.
20.7 Exploring continuous variables by simple plots
Here, we visualize the continuous variables (one by one) against the timestamp to get an idea of their temporal behavior.
df_filtered = df_processes.query("nr_iteration < 1000").sort_values("timestamp") # for performance reasons, we filter to the first 1000 cycles.fig = px.line( df_filtered, x="timestamp", y=CONTINUOUS_VARS, facet_col="variable", facet_col_wrap=1, title="Continuous Variables Over Time")fig.update_xaxes(matches='x') # share x-axis zoom/panfig.update_yaxes(matches=None) # do not share y-axis limitsfig.update_layout(height=1200) # increase figure height (y axis size)fig.show()
The plot is interactive. Zoom in to see more details. Observe that the set pressure is constant over time (within a processing step), while the measured pressure varies around the set pressure and also shows some systematic offset.
Let us investigate this difference between set and measured pressure in more detail.
It appears like the pressure difference is roughly normally distributed and that the mean of the three processing steps is similar.
Absolute counts are visually harder to compare, so let us use relative frequencies instead and normalize each of the processing steps separately.
sns.histplot(data=df_processes, x="pressure_difference_bar", hue="process_step_index", bins=50, common_norm=False, stat="percent") # common_norm is set to False to normalize each processing step separately, stat is set to "percent" to show relative frequencies in percent.
Apparently the pressure difference shares similar mean and standard deviation across the three processing steps. We can also show this by calculating the mean and standard deviation of the pressure difference for each processing step.
We conclude that the pressure is not perfectly controlled (or measured) and have a systematic offset of 7 bar and a standard deviation of roughly 2 bar.
fig = px.line( df_processes.sample(10_000).sort_values("timestamp"), # for performance reasons, we sample 10,000 rows. Always remember to sort by timestamp after sampling. x="timestamp", y=["machine_temperature_C", "air_temperature_C"], labels={"value": "Temperature (°C)", "timestamp": "Timestamp", "variable": "Type"}, title="Machine and Air Temperature Over Time")fig.show()
We observe that the machine is in a cold state when the data collection starts and heats up to a more stable plateau, where it still seems to follow the shape of the air temperature.
20.8 Aggregating plots
We are looking at iterations of the same, identical process. Let us look at a plot where the processes overlap each other, colored by error code.
First, we create a new index which starts at 0 for each piece that is manufactured and counts each timestamp within that process step. This index will be useful for plotting the different process iterations on top of each other.
# Merge error_code from df_inspections into df_processes based on 'nr_iteration'. This assigns an errorcode to the entire process iteration.df_plot = df_processes.merge( df_inspections, left_on="nr_iteration", right_index=True, how="left")fig = px.line( df_plot, x="process_inner_index", y="pressure_measured_bar", line_group="nr_iteration", color="error_code", title="Pressure per Iteration (Overlapped, Colored by Error Code)", labels={"process_inner_index": "Inner-process timestamp","pressure_measured_bar": "Pressure (bar)","error_code": "Error Code" },)# Set opacity: 0.01 for 'OK', 0.5 for others# Setting opacity is always a bit tricky. Setting it too low makes the lines almost invisble, setting it too high makes the plot too crowded.# This settings were found in trial and error and work well for that case.for trace in fig.data:if trace.name =="OK": trace.opacity =0.01else: trace.opacity =0.5# Unselect NOK_2 and NOK_3 by default, since they are very rare and would clutter the plot.for i, trace inenumerate(fig.data):if trace.name in ["NOK_2", "NOK_3"]: fig.data[i].visible ="legendonly"fig.update_layout(showlegend=True)fig.show()
Keep in mind that you can (de)select lines to plot when clicking on the according legend entry. Plotly also allows you to zoom in and pan around.
From the plot, we can see that the second processing step’s pressure is elevated for NOK_1. NOK_4 shows no obvious difference to OK.
With this in mind, let us have a closer look at the mean pressure and its standard deviation per error code and processing step.
# Aggregate: mean and std for each process_inner_index and error_codeagg_df = df_plot.groupby(['process_inner_index', 'error_code'])['pressure_measured_bar'].agg(['mean', 'std']).reset_index()fig = px.line( agg_df, x="process_inner_index", y="mean", color="error_code", error_y="std", labels={"process_inner_index": "Inner-process timestamp","mean": "Pressure (bar)","error_code": "Error Code" }, title="Mean Pressure per Error Code with Deviation")# Unselect NOK_2 and NOK_3 by defaultfor i, trace inenumerate(fig.data):if trace.name in ["NOK_2", "NOK_3"]: fig.data[i].visible ="legendonly"fig.update_layout(legend_title_text='Error Code')fig.show()
This plot confirms our previous observation that NOK_1 has a systematic higher mean pressure during the second processing step and NOK_4 does not differ notably from OK.
While the first line plot shows every individual process iteration as a singe line and therefor does not accidentally filter relevant information, it is a bit crowded and hard to read. The second plot aggregates the individual lines by calculating the mean and standard deviation, therefore losing some information, but making it easier to comprehend.
Now let us also check whether the machine temperature has an influence on failures.
df_agg = df_plot.groupby(['nr_iteration', 'error_code'])['machine_temperature_C'].agg(['mean']).reset_index()fig = px.box( df_agg, x="error_code", y="mean", title="Machine Temperature by Error Code", labels={"mean": "Mean Machine Temperature (°C)", "error_code": "Error Code"})fig.show()
Also here, NOK_4 and NOK_1 do not show a notable difference to OK, while NOK_2 and NOK_3 can not be judged well due to the low sample size.
20.9 Summary of findings
Let us summarize our findings so far in this dataset: - The pressure during the second processing step is elevated for NOK_1 compared to OK. - The machine temperature does not show a notable difference between OK and NOK_x. - The source of NOK_4 failures was not identified in this analysis. - The error codes NOK_2 and NOK_3 are too rare to draw any conclusions.